Search CORE

92 research outputs found

A Lite Romanian BERT:ALR-BERT

Author: Nicolae Dragoş Constantin
Tufiş Dan
Yadav Rohan Kumar
Publication venue: 'MDPI AG'
Publication date: 01/01/2022
Field of study

Large-scale pre-trained language representation and its promising performance in various downstream applications have become an area of interest in the field of natural language processing (NLP). There has been huge interest in further increasing the model’s size in order to outperform the best previously obtained performances. However, at some point, increasing the model’s parameters may lead to reaching its saturation point due to the limited capacity of GPU/TPU. In addition to this, such models are mostly available in English or a shared multilingual structure. Hence, in this paper, we propose a lite BERT trained on a large corpus solely in the Romanian language, which we called “A Lite Romanian BERT (ALR-BERT)”. Based on comprehensive empirical results, ALR-BERT produces models that scale far better than the original Romanian BERT. Alongside presenting the performance on downstream tasks, we detail the analysis of the training process and its parameters. We also intend to distribute our code and model as an open source together with the downstream task.publishedVersio

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

NORA - Norwegian Open Research Archives

Agder University Research Archive

The strategic impact of META-NET on the regional, national and international level

This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative's work throughout Europe in order to boost progress and innovation in our field.Peer reviewe

Institutional Repository Universiteit Antwerpen

The University of Manchester - Institutional Repository

Helsingin yliopiston digitaalinen arkisto

Utrecht University Repository

Natural Language Question Answering in Open Domains

Author: Dan Tufiş
Publication venue
Publication date: 01/10/2011
Field of study

Abstract: With the ever-growing volume of information on the web, the traditional search engines, returning hundreds or thousands of documents per query, become more and more demanding on the user patience in satisfying his/her information needs. Question Answering in Open Domains is a top research and development topic in current language technology. Unlike the standard search engines, based on the latest Information Retrieval (IR) methods, open domain question-answering systems are expected to deliver not a list of documents that might be relevant for the user‘s query, but a sentence or a paragraph answering the question asked in natural language. This paper reports on the construction and testing of a Question Answering (QA) system which builds on several web services developed at the Research Institute for Artificial Intelligence (ICIA/RACAI). The evaluation of the system has been independently done by the organizers of the ResPubliQA 2009 exercise and has been rated the best performing system with the highest improvement due to the natural language processing technology over a baseline stateof-the-art IR system. The system was trained on a specific corpus, but its functionality is independent on the linguistic register of the training data

CiteSeerX

Directory of Open Access Journals

Using a Large Set of EAGLES-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging

Author: Dan Tufiş
Publication venue
Publication date: 01/01/2000
Field of study

The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagsets. For instance, our EAGLES-compliant lexicon required a set of about 1000 morpho-syntactic description codes (MSDs) which after considering some systematic syncretic phenomena, was reduced to a set of 614 MSDs. Building reliable language models (LMs) for this tagset would require unrealistically large training data (hand annotated/validated). Our solution was to design a hidden reduced tagset and use it in building various LMs. The underlying tagger uses these LMs to tag a new text in as many variants as LMs are available. The tag differences between these variants are processed by a combiner which chooses the most likely tags. In the end, the tagged text is subject to a conversion process that maps the tags from the reduced tagset onto the more informative tags from the large tagset. We describe this processing chain and provide a detailed evaluation of the results

CiteSeerX